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Abstract 



This article introduces the idea that information compression by mul- 
tiple alignment, unification and search (ICMAUS) provides a framework 
within which natural language syntax may be represented in a simple for- 
mat and the parsing and production of natural language may be performed 
in a transparent manner. 

' In this context, multiple alignment has a meaning which is similar 

, to its meaning in bio-informatics but with significant differences, while 

unification means a simple merging of matching patterns, a meaning which 
""^ | is related to but simpler than the meaning of that term in logic. The 

, concept of search in the present context means search for alignments which 

are 'good' in terms of information compression, using heuristic methods 
or arbitrary constraints (or both) to restrict the size of the search space. 
. These concepts are embodied in a software model, SP61. The organi- 

S_t ' sation and operation of the model are described and a simple example is 

presented showing how the model can achieve parsing of natural language. 

Notwithstanding the apparent paradox of 'decompression by compres- 
sion', the ICMAUS framework, without any modification, can produce a 
sentence by decoding a compressed code for the sentence. This is illus- 
trated with output from the SP61 model. 

The article includes four other examples - one of the parsing of a 
sentence in French and three from the domain of English auxiliary verbs. 
These examples show how the ICMAUS framework and the SP61 model 
can accommodate 'context sensitive' features of syntax in a relatively 
simple and direct manner. 

An important motivation for this research is the possibility of develop- 
ing the ICMAUS framework as a unifying framework for diverse aspects 
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of computing in addition to those described in this article. Other as- 
pects which appear to fall within the scope of the ICMAUS framework 
but which are outside the scope of this article, include the representa- 
tion of natural language semantics, best-match pattern recognition and 
information retrieval, deductive and probabilistic reasoning, planning and 
problem solving, and unsupervised inductive learning. 

Key Words: natural language; syntax; parsing; production; multiple align- 
ment; unification; information compression; MML; MDL. 
Category: 1.2.7 

1 Introduction 

This article introduces the idea that information compression (IC) by multiple 
alignment, unification and search (ICMAUS) provides a framework within which 
natural language syntax may be represented in a simple format and the parsing 
and production of natural language may be performed in a transparent manner. 

In this context, multiple alignment has a meaning which is similar to its 
meaning in bio-informatics but with significant differences, while unification 
means a simple merging of matching patterns, a meaning which is related to 
but simpler than the meaning of that term in logic. In the present context, 
search means search for alignments amongst patterns which are 'good' in terms 
of information compression, with constraints to reduce the size of the search 
space as described in Section |2 

The mechanisms for heuristic search which are incorporated in the ICMAUS 
framework as it has been developed here allow syntactic knowledge to be ex- 
pressed as patterns (as described in Section 13. 1|) , a mode of expression which is 
significantly different from existing formalism. 1 The use of patterns (in the sense 
of this article), has potential advantages compared with existing formalisms. 

These concepts are embodied in a software model, SP61. This article de- 
scribes the organisation and operation of the model with examples of what the 
model can do. 

1.1 Novelty of proposals 

Aspects of these proposals which appear to be novel are described in the follow- 
ing subsections. 

1 In this research, the term pattern has been adopted as a general term which means an 
array of symbols of one, two or more dimensions. Notwithstanding the fact that this research 
has so far been largely restricted to one-dimensional sequences of symbols, the term pattern is 
generally used in preference to the term sequence as a reminder of the intention, later in the 
research programme, to generalise the concepts to two or more dimensions. Formal definitions 
of terms like symbol and pattern as they are used in this research are given in Appendix 1X1 
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1.1.1 Parsing (with choices at many levels) as multiple alignment 

The most novel feature of the present proposals appears to be the idea that 
parsing, in the sense understood in theoretical and computational linguistics 
and natural language processing, may be understood as multiple alignment. 

A concept of parsing is already well-established in the literature on data 
compression (see, for example, |Storer 88] *). In that context, it means a process 
of analysing data into segments, each of which is replaced by a relatively short 
'code' associated with the given segment in a 'dictionary' of segments. 

But this kind of parsing is simpler than 'linguistic' kinds of parsing. In the 
first case, although segments may have internal hierarchical structure, alterna- 
tives can be chosen only at one level. In the second kind of parsing, which is the 
focus of interest in this article, there may be alternatives at arbitrarily many 
levels in the grammar which is used to guide the parsing. 



1.1.2 Parsing as information compression 

Research on parsing and related topics within computational linguistics and 
AI does not normally consider these topics in terms of information com- 
pression (IC) (but see, for example, |Berger et al. 96| IHu et al. 97\ ). How- 
ever, there is a well-developed tradition of parsing and linguistic analysis in 
terms of probabilities, with associated concepts such as 'stochastic grammars', 
'maximum-likelihood', 'Bayesian inference' and 'statistical analysis' (see, for 



example, 


Abney 971 IBlack et al. 931 


IDreuth and Ruber 971 IGarside et al. 871 


|Takahashi and Sagayama 97| 


IWu 971 
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Lucke 95|) and there is a close connection 



1.1.3 Production of language as ICMAUS 

Essentially the same points as were made above about parsing apply also to the 
production of language. It is interesting that the ICMAUS framework, without 
any modification, lends itself to the production of language as well as it does to 
parsing (see Section IHJl. 



1.1.4 Representing syntax with patterns 

As we shall see, the ICMAUS framework allows natural language syntax to be 
represented with patterns in a manner which is significantly different from other 
formalisms. 

2 Measures of frequency or probability have a key role in techniques for economical 
coding such as the Huffman method or the Shannon-Fano-Elias (S-F-E) method (see 
ICover a nd Thomas 91.). Conversely, measures of compression may be translated into mea- 
sures of probability (see I Wolff 99bl 1. 
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1.2 Background and context 

The proposals in this article have been developed within a programme of re- 
search developing the 'SP' conjecture 3 that all kinds of computing and formal 
reasoning may usefully be understood as information compression by pattern 
matching, unification and search, and developing a 'new generation' computing 
system based on this thinking ( [Wolff 90| to |Wolff 00) 1. 

This entire programme of research is based on an earlier programme of re- 
search into unsupervised learning of language structures (see [Wolff 88llWo lff 82 
and earlier articles cited there) . That research and the present research are based 
on principles of Minimum Length Encoding (MLE). Relevant sources are cited 
in Section [Ql 

The overall aim of this research programme is the integration and sim- 
plification of concepts in computing and cognition. Besides the aspects of 
natural language processing considered in this article, the ICMAUS frame- 
work appears to have potential to accommodate several other aspects of com- 
puting and cognition, including unsupervised learning [Wolff 96) . the repre- 
sentation of non-linguistic 'semantic' structures (examples may be found in 
Wolff 99 E)lWolff 98c)), m athematics and logic [Wolff 00) . pro babilistic reason- 
ing [Wolff 99EI IWoltf 96) . best-match information retrieval ( [Wolff 94a) ) and 
best-match pattern recognition ( Wolff 95a ). It can be argued ( Wolff 99a ) 
that the ICMAUS framework provides an interpretation for the organisation 
and operation of any Universal Turing Machine (UTM), and equivalent models 
of 'computing' such as the Post Canonical System (PCS). 

1.3 Related research 

As an attempt to integrate concepts across several areas of computing, the SP 
programme naturally has many connections with other research in the several 
areas that it seeks to integrate. Some connections are described in Wolff 911 
IWolff 93llWolff 95b| . 

In terms of theoretical foundations, the closest links are with work on 
Algorithmic Information Theory (AIT, see, for example, |Li and Vitanyi 93"! ) 
and Minimum Length Encoding (MLE, see, for example, 
[Solom onoff 64 , Wallac eand Boulton 681 IRissanen 781 

Bcll oti and Gammerman "961 ICammerman 91\ ) which is itself closely related to 
Bayesian inference (see, for example, [Cheeseman 901 [Pcdnaul t"9T) 1. 

1.3.1 Distinctive features of the SP programme compared with MLE 
and AIT 

Although the SP theory is based on MLE principles, there are important dif- 
ferences in objectives and orientation between the SP programme and other 

3 Information compression may be interpreted as a process of maximising Simplicity in in- 
formation (by reducing redundancy) whilst retaining as much as possible of its non-redundant 
descriptive Power. Hence the sobriquet 'SP' which has been applied to these ideas. 
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research in MLE and AIT. These differences are described in Section 3.6 of 
[Wolff 95 b| and Sections 7.1 and 7.2 of |Wolff 94b| L In brief, the main differ- 
ences are: 



• The SP programme seeks to integrate all kinds of computing and formal 
reasoning within a framework of information compression. This goal is 
broader than it is in other research in AIT or MLE. 

• The SP programme is based on the hypothesis that all kinds of infor- 
mation compression may be understood in terms of multiple alignment, 
unification and search. In essence this means the hypothesis that all kinds 
of information compression is achieved by the unification of matching pat- 
terns. All existing and projected SP models are restricted to ICMAUS 
mechanisms and avoid 'arithmetic coding' and other mathematical tech- 
niques which are used for information compression. 

The restriction has been imposed in the interests of simplicity in the SP 
theory. The aim is to build a theoretical framework from a 'bedrock' of 
apparently primitive operations of matching symbols and patterns and 
unifying symbols and patterns. The theory should avoid including any 
concepts that cannot be derived from this foundation. 

If, as conjectured, arithmetic and, perhaps, mathematics, may be under- 
stood in terms of ICMAUS (see [Wolff QQ| ), then compression techniques 
that use mathematical concepts may also be understood in terms of IC- 
MAUS. But until this has been demonstrated, the adoption of arithmetic 
coding or any other mathematical technique would add unwanted com- 
plexity to the SP model. 

• In the first point above, the phrase "all kinds of computing" includes the 
concept of 'computing' itself in its full depth and generality. Thus the SP 
programme hypothesises that other models of computing such the Turing 
model or the Post Canonical System may be understood in terms of the 
ICMAUS concepts (see [Wolff 99a] ). By contrast, researchers in AIT and 
MLE accept the Turing model (and equivalent models) as the foundation 
of concepts in computing. 

1.4 Scope of this article 

This paper describes new concepts in the representation of syntax and in the 
parsing and production of language. It does not describe a complete working 
system, including a complete grammar of one or more languages, and it should 
not be evaluated as such. 

In the space available, it is possible only to present these proposals in outline. 
It has been necessary to omit many details and there are many associated issues 
which could not be discussed. 

Although integration of concepts right across the field of computing provides 
the main motivation for this programme of research (as noted in Section 11.21) , 
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topics other than syntax, parsing and production of natural language will not 
be considered except briefly where they are relevant. 

1.5 Presentation 

In what follows, I have tried to bring the important ideas into relief by describ- 
ing them relatively briefly in the body of the article and moving details into 
appendices. The main sections after this one are these: 

[21 introduces multiple alignment problems in general terms and describes 
how the concept has been generalised in this programme of research. 

121 describes the use of patterns (defined in Appendix ^| to represent the 
broad features of the syntax of natural language and describes how the 
parsing of language may be seen as multiple alignment. 

0] describes in outline how an alignment may be evaluated in terms of IC. 

describes the main features of the SP61 model, a partial realisation of the 
ICMAUS framework, running on a conventional computer. 

El describes how the production of language may be seen in terms of multiple 
alignment - between a 'coded' representation of a sentence and rules in a 
grammar. 

[3 presents a selection of other examples showing how the ICMAUS frame- 
work can accommodate 'context sensitive' features of syntax. 

00 discusses briefly some associated issues and makes some concluding re- 
marks. 

The appendices are as follows: 
lAl provides formal definitions of the main terms used in this article. 

IbI supplements Section 0] with a more detailed account of the method for 
evaluating the IC associated with any alignment which is used in the 
SP61 model. 

[HI describes the organisation of the SP61 model in more detail than Section 

E 

Generally speaking, small examples have been used in this article for the 
sake of clarity and to save space. It should not be assumed that the examples 
represent the limits of what the system can do (see Section I5T2|) . 
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2 Multiple alignment problems 



Multiple alignment is a term borrowed from bio-informatics where it means 
the arrangement of two or more sequences of symbols in horizontal rows one 
above the other so that, by judicious 'stretching' of sequences where necessary, 
symbols that match each other from one sequence to another can be brought 
into alignment in vertical columns. 

A 'good' alignment is, in general, one where there is a relatively large number 
of hits (positive matches) between symbols and where any gaps (sequences of 
unmatched symbols) between hits are relatively few and relatively short. The 
meaning of 'good' in this context is described in Section 0] and Appendix iBl 

Multiple alignments like these are normally used in the computational anal- 
ysis of (symbolic representations of) sequences of DNA bases or sequences of 
amino acid residues as part of the process of elucidating the structure, func- 
tions or evolution of the corresponding molecules. An example of an alignment 
of DNA sequences is shown in Figure Q 
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Figure 1: A 'good' alignment amongst five DNA sequences. 



2.1 Search and the need for constraints 

In this area of research, it is widely recognised that, with the exception of 
alignments of patterns which are very small and very few, the number of possible 
alignments of symbols is too large to be searched exhaustively. For any set of 
patterns of realistic size, a search which has acceptable speed and acceptable 
scaling properties can only be achieved if some kind of constraint is used: 

• Arbitrary parts of the search space may be excluded a priori. For example, 
in multiple alignment problems, an upper limit may be set to the size of 
any 'gap' between 'hits', as described above. 

• With 'heuristic' techniques, searching is done in stages, with a progres- 
sive narrowing of the search space in successive stages using some kind of 
measure of 'goodness' to guide the search. Heuristic techniques include 
'hill climbing' (sometimes called 'descent'), 'beam search', 'genetic algo- 
rithms', 'simulated annealing', 'dynamic programming' and others. These 
techniques may be described generically as 'metrics-guided search'. 
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Either or both of these kinds of constraint may be applied. Given one or 
both of these kinds of constraint, it is not possible to guarantee that, for any set 
of patterns, the best possible alignment has been found. For many tasks, this 
guarantee is not necessary and it is sufficient to find alignments that are "good 
enough" . 

There is now a fairly large literature about methods for finding good align- 
ments amongst two or more sequences of symbols. All of them use constraints 
of one kind or another and, for that reason, none of them can guarantee that the 
best possible result is always found. Some of the existing methods are reviewed 
in |Taylor 88| IBarton 901 IChan et a/.~92l |Day and McMorris 92| . For reasons 
which will be explained in the next section, none of the current methods seem 
to be entirely suitable for incorporation in the proposed SP system. 

2.2 Development of the concept of multiple alignment in 
the present research 

In this research, concepts associated with multiple alignment and the multiple 
alignment concept itself have been adapted and developed in the following way: 

• One (or more) of the patterns of symbols to be aligned has a special status 
and is designated as 'New'. In the context of parsing, this would be the 
sentence (or other sequence of symbols) which is to be parsed. 

• All other patterns are designated as 'Old'. In the context of parsing, 
this would be the patterns of symbols which represent grammatical 'rules' 
(more about this in Section [21 below). 

• A 'good' alignment is one which, through the unification of symbols in 
New with symbols in Old, and through unifications amongst the symbols 
in Old, leads to a relatively large amount of compression of New in terms 
of the sequences in Old. How this may be done is explained in outline in 
Section 0] and in more detail in Appendix iBl 

• By contrast with 'multiple alignment' as normally understood in bio- 
informatics, any given sequence in Old may appear two or more times 
in any one alignment and, in these cases, it is possible for the given se- 
quence to be aligned with itself. 

• As noted already, it is envisaged that, at some point in the future, the 
concept of multiple alignment as it is understood here will be generalised to 
alignments of patterns with two dimensions or higher (diagrams, pictures 
and so on). 

Notice that two or more appearances of a pattern in an alignment are re- 
peated appearances of a single entity in the alignment - and this is not the same 
as having two or more copies of a given pattern in an alignment. In the latter 
case, it is permissible to form a hit between a given symbol in one copy of a 
pattern and the corresponding symbol in another copy. In the case of two or 
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more appearances of a pattern in an alignment, it is not permissible to form a hit 
between a symbol in one appearance and the corresponding symbol in another 
appearance - because this would mean forming a hit between one symbol and 
itself. 

3 Syntax as 'patterns' and parsing as multiple 
alignment 

This section describes how the simpler aspects of syntax may be represented 
with patterns and how the parsing of a sentence in terms of a grammar may be 
seen in terms of multiple alignment. 

The example considered in this section and again in Section may give 
the impression that the ICMAUS framework is merely a trivial variation of 
familiar concepts of context-free phrase-structure grammar (CF-PSG) with their 
well-known inadequacies for representing and analysing the 'context sensitive' 
structures found in natural languages. The examples presented in Section 
show that the ICMAUS framework is much more 'powerful' than CF-PSGs and 
can accommodate quite subtle context-sensitive features of natural language 
syntax in a simple and elegant manner. 

3.1 Representing a grammar with patterns of symbols 

Figure[3shows a simple CF-PSG describing a fragment of the syntax of English. 
This grammar generates sentences like 'thisboylovesthatgirl', 't 
hatboyhatesthisgirl', and so on. Any of these sentences may be 
parsed in terms of the grammar giving a labelled bracketing like this: 

(S(NP(D t h i s)(N b o y))(V 1 o v e s) 
(NP(D t h a t)(N g i r 1))) 

or an equivalent representation in the form of a tree. 

Figure shows the grammar from Figure [21 expressed as a set of strings, 
sequences or patterns of symbols (as defined in Appendix ^) . Each pattern in 
this 'grammar' is like a re-write rule in the CF-PSG notation except that the 
rewrite arrow has been removed, some other symbols have been introduced ('0', 
'1' and symbols with an initial '#' character) and there is a number to the right 
of each rule. 

The number to the right of each rule in Figure |3 is a frequency of occurrence 
of the rule in a ('good') parsing of a notional sample of the language. These 
frequencies have a role in determining the IC associated with any alignment but 
their main significance (considered in [Wolff 99"E] and outsid the scope of this 
article) is in determining probabilities associated with any given alignment. 

The reasons for the symbols which have been added to each rule will become 
clear but a few words of explanation are in order here. The symbols '0' and '1' 
have been introduced to differentiate the two versions of the 'D' patterns, and 
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S -> NP V NP 
NP -> D N 
D -> t h i s 
D -> t h a t 
N -> g i r 1 
N -> b o y 

V -> 1 o v e s 

V -> h a t e s 

Figure 2: A CF-PSG describing a fragment of English syntax. 

likewise for the 'N' patterns and 'V patterns. They enter into matching and 
unification in exactly the same way as other symbols. Although the symbols are 
the same as are used in other contexts to represent numbers they do not have 
the meaning of numbers in this grammar. 

S NP #NP V #V NP #NP #S (500) 
NP D #D N #N #NP (1000) 
D t h i s #D (600) 
D 1 t h a t #D (400) 
N g i r 1 #N (300) 
N 1 b o y #N (700) 
V01oves#V (650) 
Vlhates#V (350) 

Figure 3: The grammar from Figure |21 recast as patterns of symbols. 

The symbols which begin with '#' (e.g., '#S', '#NP') serve as 'termination 
markers' for patterns in the grammar. Although their informal description 
as 'termination markers' suggests that these symbols are meta symbols with 
special meaning, they have no hidden meaning and they enter into matching 
and unification like every other symbol. 

In general, all the symbols that can be seen in Figure OH and other examples 
in this article are simply 'marks' that can be discriminated from each other 
by yes/no matches but otherwise have no intrinsic meaning. Although some of 
these symbols can be seen to serve a distinctive role, there is no hidden meaning 
attached to any of them and no formal distinction between upper- and lower- 
case letters or between digit symbols and alphabetic symbols and so on (see 
Appendix |A"|) . 

3.2 Parsing as alignment of a sentence and rules in a gram- 
mar 

Figure 01 shows how a parsing of the sentence 'thisboylovesthatg 
i r 1' may be seen as an alignment of patterns which includes the sentence and 
relevant rules from the grammar shown in Figure [3] The similarity between 
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this alignment and the conventional parsing may be seen if the symbols in the 
alignment are 'projected' on to a single sequence, thus: 

SNPDOthis#DNlboy#D #NP V01oves#V 
NPDlthat#DNOgirl#N #NP #S 

In this projection, the two instances of 'NP' in the second column of the 
alignment have been merged or 'unified' and likewise for the two instances of 
'D' in the third column and so on wherever there are two or more instances of 
a symbol in any column. 

This projection is the same as the conventional parsing except that '0' and 
'1' symbols are included, right bracket symbols (')') are replaced by 'termination 
markers' and each of the upper-case symbols is regarded both as a 'label' for a 
structure and as a left bracket for that structure. 

Notice that the pattern 'NP D #D N #N #NP' appears twice in the align- 
ment in Figure 01 in accordance with what was said in Section 12^1 In general, 
any pattern in the grammar used for parsing may appear two or more times in 
an alignment. Other examples will be seen later. 

As was noted in Section l2~2l the sentence or other sequence of symbols to be 
parsed is regarded as New, while the rules in the grammar are regarded as Old. 
For the sake of readability and ease of interpretation, New is normally placed 
at the top of each alignment with patterns from Old below it. 

For the sake of clarity in Figure^land other alignments shown in this article, 
each appearance of a pattern in any alignment is given a line to itself (so that 
the two appearances of 'NP D #D N #N #NP' in Figure^Jare on two different 
lines). Apart from the convention that New is always at the top, the order 
in which patterns appear (from top to bottom of the alignment) is entirely 
arbitrary. An alignment in which the patterns appear in one order is totally 
equivalent to an alignment in which they appear in any other order, provided 
all other aspects of the alignment are the same. 

All the examples of parsing by alignment shown in this article are output 
from the SP61 model and in every case, the alignment shown is the best align- 
ment (in terms of IC) that the model has found with the given sentence in New 
(in row 0) and the grammar identified in the caption in Old. 

4 Multiple alignments and information com- 
pression 

This section describes in broad terms how alignments are evaluated in terms 
of IC. A more detailed account of the method of evaluation used in the SP61 
model is given in Appendix FBI 

Although IC and related concepts of probability are well-established in the 
evaluation of alignments in bio-informatics (see, for example, 
[Reichert et al. 731 IFelsenstein 8lllAllison et al. 92[IChan et al. 921 
Allis on and Wallace 94HWolff 94a] ). the framework here is different (as 
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Figure 4: The best alignment found by SP61 with 'thisboylovesthatgir 
P in New and the grammar from Figure in Old. 
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described in Section which means that existing methods cannot be 
applied directly. 

In the present work, a good alignment is one which allows an economical 
coding of New in terms of the patterns in Old. The compression method ex- 
ploits the elementary principle that a (relatively long) sequential pattern which 
repeats two or more times in a body of information may be replaced by a shorter 
identifier, 'tag' or 'code' associated with that pattern in some kind of 'dictio- 
nary' of patterns. In effect, each instance of the pattern in the data is unified 
with the same pattern as it appears in the repository of patterns. This is the 
basis of all standard methods for IC (see [Storer 88] ). 

In the ICMAUS scheme, this principle can be applied at a single 'level', as 
in the majority of standard compression schemes, but it can also be applied at 
an arbitrary number of 'higher' levels. To see what this means, consider the 
alignment shown in Figure 

At the most basic level, a word like 'this' in New (the sentence being 
parsed) is matched by the pattern 'D t h i s #D' in Old (the grammar) which 
means that the symbols 'D #D' can be used as a 'code' for the pattern. 4 

A certain amount of compression can be achieved by encoding the words 
in the sentence being parsed at a single level. But more compression can be 
achieved by taking advantage of the fact that the words in the sentence are not 
a random sequence of words but they conform to grammatical patterns defined 
in the grammar like 'NP D #D N #N #NP' and 'S NP #NP V #V NP #NP 
#S'. The details of how this may be done are explained in Appendix iBl As 
indicated above, this kind of encoding at a 'higher' level can be applied through 
arbitrarily many levels, depending on the patterns of redundancy in New and 
in the language from which it comes. 

5 The SP61 model 

Given the example sentence discussed earlier (shown at the top of Figurc0J) and 
the grammar in Figure|3 the SP61 model can find the alignment shown in Figure 
01 and, in terms of compression, it identifies it as the best alignment amongst 
the several which it forms for the given sentence and the given grammar. Given 
relevant sentences and grammars, the model finds all the other alignments shown 
in this article (they are indeed taken directly from the output of the model) . In 
each case, the alignments shown are the best in terms of IC amongst alternative 
alignments that the model finds for a given sentence and grammar. 

It is interesting to see that, in general, alignments that are good in terms 
of IC are also 'correct' in terms of our linguistic intuitions. This relationship 
holds true for several other examples of parsing by the model. Space limitations 
prevents them being shown here but they can be found in Wolff 98b . 

4 Although the code 'D #D' does not appear to be much smaller than 'this' in New, 
a weighting factor ensures that the number of bits to be encoded is significantly larger than 
the code, as explained in Appendix IBl 
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5.1 How the model works 

The SP61 model works by building alignments in a pairwise fashion selecting 
the 'best' in terms of compression at each stage. The method thus constitutes 
a fairly straightforward application of 'metrics-guided' search: examine large 
search spaces in stages, narrowing the search progressively at each stage using 
some kind of 'search metric' to guide the search. This accords with the need 
for constraints in searching what is normally an astronomically large space of 
possible alignments (Section |OJ. 

Alignments can be built up in a pairwise manner because, at every stage, 
new alignments are accepted only if they can 'project' into a one-dimensional 
pattern as described in Section 1X^1 Since any such alignment can be treated as 
a single sequence of symbols it is possible to match it against any of the original 
patterns in the grammar or any of the alignments formed at earlier stages. 

The program starts by searching for 'good' alignments between the sentence 
to be parsed and patterns in the grammar. For the example in Figure 01 the 
best alignments found at this stage are between the individual words in the 
sentence and corresponding patterns in the grammar. 

At the next stage, the program looks for 'good' alignments between the best 
of the alignments previously found and patterns in the grammar. The 'best' 
alignments at this stage are ones between the alignments corresponding to the 
words and 'higher level' patterns in the grammar. Thus 'D t h i s #D' and 'N 
1 b o y #N' form an alignment with 'NP D #D N #N #NP', giving 'NP D t 
h i s #D N b o y #D #NP'; likewise, 'V 1 o v e s #V forms an alignment 
with 'S NP #NP V #V NP #NP #S' giving 'S NP #NP V 1 o v e s #V NP 
#NP #S'; then 'D 1 t h a t #D' and 'N g i r 1 #N' form an alignment with 
'NP D #D N #N #NP' giving 'NP D 1 t h a t #D N g i r 1 #N #NP'. 

Finally, 'NP D t h i s #D N 1 b o y #D #NP' and 'NPD 1 t hat #DN 
1 g i r 1 #N #NP' are aligned with 'S NP #NP V 1 o v e s #V NP #NP #S' 
giving the result shown in Figure^ At each stage, many 'worse' alignments are 
formed which are weeded out by the selection process. 

An outline of how the model works is shown as pseudocode in Appendix [U] 
together with explanatory text. 

5.2 Computational complexity 

Given the well-known computational demands of multiple alignment problems, 
readers may reasonably ask whether the proposed framework for parsing would 
scale up to handle realistically large grammars and longer sentences. 

Estimates of the time complexity and space complexity of the model are 
given here largely without justification owing to shortage of space. In a serial 
processing environment, the time complexity of the model has been estimated 
Wolff 98c to be approximately 0(log2n x nm), where n is the length of the 
sentence (in bits) and m is the sum of the lengths of the patterns in the gram- 
mar (in bits). In a parallel processing environment, the time complexity may 
approach 0(log2n x n), depending on how the parallel processing is applied. In 
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serial and parallel environments, the space complexity should be O(m). 

These estimates are based on the assumption that any given sentence is 
processed as a single entity. However, the program has been designed so that 
it is possible to process any given sentence as a succession of 'windows' (see 
Appendix IC.4|) . Since it is possible to discard all but the best intermediate 
results at the end of each window, the time complexity of the model in a serial 
environment and operating in 'windows' mode appears to be approximately 
O(rom). The time complexity of the program in 'windows' mode in a parallel 
environment depends on exactly how the parallelism is applied but, in general, 
it is likely to be better than in a serial environment. 

6 Decoding by compression: the production of 
language 

As described in Appendix iBl a succinct, coded representation of a sentence may 
be derived from a 'good' alignment amongst a set of sequences which includes 
the sentence and rules in an appropriate grammar. This section proposes an idea 
which at first sight may seem contradictory or paradoxical: that the decoding 
of a coded representation of a sentence may be achieved by precisely the same 
process of compression (by multiple alignment, unification and search) as was 
used to achieve the original encoding! Although this may superficially appear 
to be nonsense, careful reading of this section should convince readers that the 
proposal is sound and that no laws of logic or mathematics have been violated. 

In this reversal of the original process of encoding, a sentence may be cre- 
ated by finding a 'good' alignment amongst a set of patterns that includes a 
pattern that encodes the sentence (in New) together with rules in the grammar 
which were used to create the encoding (in Old). In both cases (encoding and 
decoding), alignments may be evaluated in terms of the potential compression 
of one sequence: the sentence in the first case and the encoded representation 
of the sentence in the second case. 

Figure shows an alignment of this kind produced by the SP61 model. At 
the top of the figure is the sequence 'S 1 1 #S' which is the encoded version 
of'thisboylovesthatgirl', as described in Appendix IBl The other 
sequences in the figure are rules from the grammar shown in Figure 

As with parsing (Section QJl, an alignment may be interpreted by projecting 
its constituent symbols into a single sequence. In the case of the alignment 
in Figure [3 the result of this projection is exactly the same as was shown in 
Section f5~21 Although this sequence contains grammatical symbols other than 
words, it has the right words in the right order and may thus be regarded as 
a realisation of the sentence corresponding to the coded sequence 'S 1 1 
#S'. 
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Figure 5: The best alignment found by SP61 with 'S 1 1 #S' in New and the 
grammar from Figure E]in Old. 
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6.1 Decompression by compression 

The alignment shown in Figure achieves the paradoxical effect of 'decompres- 
sion by compression' because the 'input' (in New) is a compressed code for a 
sentence and the 'output' is an alignment whose unification contains the original 
uncompressed sentence (together with 'service' symbols like 'S', 'NP' etc). 

How can this paradox be resolved and how is it possible to achieve compres- 
sion with something (the code for the sentence) which is already compressed? 

This is not as mysterious as it may at first sight seem. The answer to 
the riddle is the provision of two distinct sizes for each symbol, as described in 
Appendices IB . II and lB . 7l The minimum size (in bits) is the theoretical minimum 
calculated according to the S-F-E method, while the actual size (in bits), which 
is the real size of the symbol in a practical system, is larger than the minimum 
size by some constant factor. 

In the calculation of the compression difference (CD) for each alignment 
(described in Appendix IB. 7(1 . the actual sizes of symbols are used to compute 
-B/v, the number of bits required to represent, in 'raw' form, the symbols from 
New that enter into the alignment. But the minimum sizes of symbols are used 
to compute Be, the number of bits required to encode the alignment. Thus the 
CD which is derived from Bn and Be represents the maximum compression 
which is theoretically possible (with the given alignment within the ICMAUS 
framework) . 

Given the distinction between a theoretical minimum size for each symbol 
and a larger actual size, and given this way of calculating CD, the alignment 
method that was used for the original parsing can be used, without any mod- 
ification, to find the best alignment for the code for the sentence (in terms of 
CD values) and to discriminate it from the many 'wrong' alignments that are 
possible. 

The foregoing remarks reflect what appears to be a general truth about IC: 
if lossless compression of a body of information is required (so that the original 
form of the information can be reconstituted) then it seems that the encoded 
form of the information must always contain some residual redundancy. The 
existence of this residual redundancy may not always be obvious but it seems 
that decoding is not possible without it. 

7 Context sensitive aspects of syntax 

The examples considered so far may have given the impression that the ICMAUS 
framework is merely a trivial variation on CF-PSG. This section presents align- 
ments from two areas of syntax showing how the ICMAUS framework as it 
is realised in the SP61 model may accommodate 'context sensitive' aspects of 
syntax. 
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7.1 Syntactic dependencies in French 

It often happens in natural languages that there are syntactic dependencies 
between one part of a sentence and another. For example, there is usually a 
'number' dependency between the subject of a sentence and the main verb of 
the sentence: if the subject has a singular form then the main verb must have 
a singular form and likewise for plural forms of subject and main verb. 

A prominent feature of these kinds of dependency is that they are often 
'discontinuous' in the sense that the elements of the depency can be separated, 
one from the next, by arbitrarily large amounts of intervening structure. For 
example, the subject and main verb of a sentence must have the same number 
(singular or plural) regardless of the size of qualifying phrases or subortinate 
clauses that may come between them. 

Another interesting feature of syntactic dependencies is that one kind of 
dependency (e.g., number dependency) can overlap other kinds of dependency 
(e.g., gender (masculine/ feminine) dependency), as can be seen in the following 
example. 

In the French sentence Les plumes sont vertes ("The feathers are green") 
there are two sets of overlapping syntactic dependencies like this: 

P P P P Number dependencies 

Les plume s sont vert e s 

F F Gender dependencies 

In this example, there is a number dependency, which is plural ('P') in this 
case, between the subject of the sentence, the main verb and the following 
adjective: the subject is expressed with a plural determiner (Les) and a noun 
(plume) which is marked as plural with the suffix (s); the main verb (sont) has 
a plural form and the following adjective (vert) is marked as plural by the suffix 
(s). Cutting right across these number dependencies is the gender dependency, 
which is feminine ('F') in this case, between the feminine noun (plume) and the 
adjective (vert) which has a feminine suffix (e). 

For many years, linguists puzzled how these kinds of syntactic depen- 
dency could be represented succinctly in grammars for natural languages. 
But then elegant solutions were found in Transformational Grammar (TG, 
Chomsky 57|) and, lat er, in systems like Definite Clause Grammars (DCG, 
Pereira and Warren SOp . based on Prolog. 

The solution proposed here is different from any established system and is 
arguably simpler and more transparent than other systems. It will be described 
and illustrated with a fragment of the grammar of French which can generate 
the example sentence just shown. This fragment of French grammar, shown in 
Figure El is expressed with 'patterns' in the same manner as the grammar in 
Figure |3 and others in this article. 

Apart from the use of patterns as the medium of expression, this grammar 
differs from systems like TG or DCGs because the parts of the grammar which 
express the forms of 'high level' structures like sentences, noun phrases and verb 
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S NP #NP VP #VP #S (500) 

NP D #D N #N #NP (700) 

VP V #V A #A #VP (300) 

VP 1 V #V P #P NP #NP #VP (200) 

P sur #P (50) 

P 1 sous #P (150) 

V SNG est #V (250) 

V PL sont #V (250) 

D SNG M le #D (90) 

D SNG M 1 im #D (120) 

D SNG F la #D (130) 

D SNG F 1 une #D (110) 

D PL les #D (125) 

D PL 1 des #D (125) 

N NR #NR NS1 #NS1 #N (450) 

NS1 SNG - #NS1 (250) 

NS1 PL s #NS1 (200) 

NR M papier #NR (300) 

NR F plume #NR (400) 

A A AR #AR AS1 #AS1 AS2 #AS2 #A (300) 

AS1 F e #AS1 (100) 

AS1 M - #AS1 (200) 

AS2 SNG - #AS2 (175) 

AS2 PL s #AS2 (125) 

AR noir #AR (100) 

AR 1 vert #AR (200) 

NP SNG SNG #NP (450) 

NP PL PL #NP (250) 

NP M M #NP (450) 

NP F F #NP (250) 

N SNG V SNG A SNG (250) 

N PL V PL A PL (250) 

N M V A M (300) 

N F V A F (400) 

Figure 6: A fragment of French grammar with patterns for number dependencies and 
gender dependencies. 
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phrases (represented by the first four patterns in Figure |5J do not contain any 
reference to number or gender. 

Instead, the grammar contains patterns like 'NP SNG SNG #NP' and 'N M 
V AM' (the last eight patterns in Figure EJ. The first of these says, in effect, 
that between the symbols 'NP' and '#NP' there are two structures marked 
as singular ('SNG'). In this simple grammar, there is no ambiguity about what 
those two structures are: they can only be a determiner ('D') followed by a noun 
('N'). In a more complex grammar, there would need to be disambiguating 
context to establish the 'correct' alignments of symbols. The second pattern 
says, in effect, that in a sentence which contains the (discontinuous) sequence 
of symbols 'N V A', the noun ('N') is masculine ('M') and the adjective ('A') is 
also masculine. 

7.1.1 An alignment 

The alignment in Figures and |H1 shows the best alignment found by SP61 with 
our example sentence in New and the grammar from Figure El in Old. 5 The 
main constituents of the sentence are marked in an appropriate manner and 
dependencies for number and gender are marked by patterns appearing in rows 
13, 14 and 15 of the alignment. 

7.1.2 Discussion 

Readers may wonder why, in the example just shown, the pattern 'NP PL 
PL #NP' is separate from the pattern 'N PL V PL A PL'. Why not simply 
merge them into something like 'NP PL N PL #NP V PL A PL'. The reason 
for separating the number dependencies in noun phrases ('NP') from the other 
number dependencies is that they do no always occur together. For example, 
noun phrases may be found within one of the two verb-phrase ('VP') patterns 
shown in Figure [S] (the fourth pattern in the grammar) and this context does 
not contain the 'N ... V ... A ...' pattern. 

Another question that may come to mind is what happens when there are 
one or more subordinate clauses between the subject of a sentence and the main 
verb of the sentence, and when there are verbs in the subordinate clauses. In 
the case of number dependencies between subject and main verb, how can the 
system distinguish between the main verb and one of the verbs in the subor- 
dinate clauses? There is insufficient space here for a full answer to this ques- 
tion. In brief, it seems that this kind of problem can be overcome by providing 
disambiguating context in the patterns that express number dependency (see 
[Wolff 98hp. 

5 By contrast with the alignments shown in Figures |I]and|2] the alignment in Figures 171 and 
ISI and all subsequent alignments in this article, were originally created with spaces between 
the letters in every word, as in Figures [3] and [5] However, for the sake of readability (as 
suggested by one of the referees) and to save space, the alignments have been prepared again 
with no spaces within words (except where suffixes need to be identified as distinct entities 
within the grammar). 
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Figure 7: The best alignment found by SP61 with 'les plume s sont vert e s' in New 
and the grammar from Figure|H]in Old (Part 1). 
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Figure 8: The best alignment found by SP61 with 'les plume s sont vert e s' in New 
and the grammar from Figure HJ in Old (Part 2). 
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These ideas are still relatively new and there is plenty of scope for further 
investigation and development. 

7.2 Dependencies in the syntax of English auxiliary verbs 

This subsection presents a grammar and examples showing how the syntax of 
English auxiliary verbs may be described in the ICMAUS framework. Before 
the grammar and examples are presented, the syntax of this part of English 
is described and alternative formalisms for describing the syntax are briefly 
discussed. 

In English, the syntax for main verbs and the 'auxiliary' verbs which may 
accompany them follows two quasi-independent patterns of constraint which 
interact in an interesting way. 

The primary pattern of constraint may be expressed with this sequence of 
symbols, 

M H B B V, 

which should be interpreted in the following way: 

• Each letter represents a category for a single word: 

— 'M' stands for 'modal' verbs like 'will', 'can', 'would' etc. 

— 'H' stands for one of the various forms of the verb 'to have'. 

— Each of the two instances of 'B' stands for one of the various forms 
of the verb 'to be'. 

— 'V stands for the main verb which can be any verb except a modal 
verb (except, arguably, when it occurs by itself). 

• The words occur in the order shown but any of the words may be omitted. 

• Questions of 'standard' form follow exactly the same pattern as statements 
except that the first verb, whatever it happens to be ('M', 'H', the first 
'B', the second 'B' or 'V'), precedes the subject noun phrase instead of 
following it. 

Here are two examples of the primary pattern with all of the words included: 

It will have been being washed 
M H H B V 

Will it have been being washed? 
M H H B V 

The secondary constraints are these: 
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• Apart from the modals, which always have the same form, the first verb 
in the sequence, whatever it happens to be ('H', the first 'B', the second 
'B' or 'V'), always has a 'finite' form (the form it would take if it were 
used by itself with the subject). 

• If an 'M' auxiliary verb is chosen, then whatever follows it ('H', first 'B', 
second 'B', or 'V') must have an 'infinitive' form (i.e., the 'standard' form 
of the verb as it occurs in the context 'to ...', but without the word 'to'). 

• If an 'H' auxiliary verb is chosen, then whatever follows it (the first 'B', the 
second 'B' or 'V') must have a past tense form such as 'been', 'seen', 'gone', 
'slept', 'wanted' etc. In Chomsky's Syntactic Structures |Chomsky 57| , 
these forms were characterised as en forms and the same convention has 
been adopted here. 

• If the first of the two 'B' auxiliary verbs is chosen, then whatever follows 
it (the second 'B' or 'V') must have an ing form, e.g., 'singing', 'eating', 
'having', 'being' etc. 

• If the second of the two 'B' auxiliary verbs is chosen, then whatever fol- 
lows it (only the main verb is possible now) must have a past tense form 
(marked with en as above) . 

• The constraints apply to questions in exactly the same way as they do to 
statements. 

Figure shows a selection of examples with the dependencies marked. 



7.2.1 Transformational grammar and English auxiliary verbs 

In Figure[5]it can be seen that in many cases but not all, the dependencies which 
have been described may be regarded as discontinuous because they connect one 
word in the sequence to the suffix of the following word thus bridging the stem 
of the following word. Three instances of this discontinuous kind of dependency 
can be seen in the first example in the figure. 

In Syntactic Structures, |Chomsky 57| showed that this kind of regularity in 
the syntax of English auxiliary verbs could be described using Transformational 
Grammar (TG). For each pair of symbols linked by a dependency ('M inf, 'H 
en', 'Bl ing', 'B2 en') the two symbols could be shown together in the 'deep 
structure' of a sentence and then moved into their proper position or modified 
in form (or both) using 'transformational rules'. 

This elegant demonstration argued persuasively in favour of TG compared 
with alternatives which were available at that time. However, later research 
has shown that the same kinds of regularities in the syntax of English auxiliary 
verbs can be described quite well without recourse to transformational rules, 
using Definite Clause Grammars (DCGs) or other systems which do not use 
that type of rule (see, for example, [Pereira and Warren 801 IGazdar 89| ). An 
example showing how English auxiliary verbs may be described using the DCG 
formalism may be found in jWolff 871 pp. 183-184]). 
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H en B2 en 

It will have been being washed 
M inf Bl ing V 

Bl ing 

Will he be talking? 
M inf V 

V 

They have finished 

H en 

fin 

Are they gone? 

B2 en 

fin V 

Bl ing 

Has he been working? 

H en V 

fin 

Figure 9: A selection of example sentences in English with markings of dependencies 
between the verbs. Key: M — modal, H = forms of the verb 'have', Bl — first instance 
of a form of the verb 'be', B2 = second instance of a form of the verb 'be', V = main 
verb, fin = a finite form, inf — an infinitive form, en = a past tense form, ing = a verb 
ending in 'ing'. 
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7.2.2 English auxiliary verbs in the ICMAUS framework 

Figures EH an d ^2 show an 'ICMAUS' grammar for English auxiliary verbs 
which exploits several of the ideas described earlier in this article. Figure IT21 
Figures ^| and O and Figure ^| show the best alignments in terms of IC 
for three different sentences produced by the SP61 model using this grammar. 
In the following paragraphs, aspects of the grammar and of the examples are 
described and discussed. 

S ST NP #NP XI #X1 XR #S (3000) 
S Q XI #Xl NP #NP XR #S (2000) 
NP SNG it #NP (4000) 
NP PL they #NP (1000) 

XI V M #V #Xl XR XH XB XB XV #S (1000) 

XI 1 XH FIN #XH #X1 XR XB XB XV #S (900) 

XI 2 XB1 FIN #XB1 #X1 XR XB XV #S (1900) 

XI 3 V FIN #V #X1 XR #S (900) 

XH V H #V #XH XB #S (200) 

XB XB1 #XB1 XB #S (300) 

XB XB1 #XB1 XV #S (300) 

XB1 V B #V #XB1 (500) 

XV V #V #S (5000) 

M INF (2000) 

H EN (2400) 

B XB ING (2000) 

B XV EN (700) 

SNG SNG (2500) 

PL PL (2500) 

Figure 10: A grammar for the syntax of English auxiliary verbs (Part 1). 



7.2.3 The primary constraints 

The first line in the grammar is a sentence pattern for a statement (marked 
with the symbol 'ST') and the second line is a sentence pattern for a question 
(marked with the symbol 'Q'). Apart from these markers, the only difference 
between the two patterns is that, in the statement pattern, the symbols 'XI 
#X1' follow the noun phrase symbols ('NP #NP'), whereas in the question 
pattern they precede the noun phrase symbols. As can be seen in the examples 
in Figure IT2l Figures IT51 and ITU and Figure IT51 the pair of symbols, 'XI #X1', 
has the effect of selecting the first verb in the sequence of auxiliary verbs and 
ensuring its correct position with respect to the noun phrase. In Figure 1121 it 
follows the noun phrase, while in Figures ITTA and ITU and Figure IT51 it precedes 
the noun phrase. 

Each of the next four patterns in the grammar have the form 'XI ... #X1 
XR ... #S'. The symbols 'XI' and '#X1' align with the same pair of symbols 
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V M will #V (2500) 

V M 1 would #V (1000) 

V M 2 could #V (500) 

V H INF have #V (600) 

V H PL FIN have #V (400) 

V H SNG FIN has #V (200) 

V H EN had #V (500) 

V H FIN had #V (300) 

V H ING hav ING1 #ING1 #V (400) 

V B SNG FIN is #V (500) 

V B SNG FIN 1 was #V (400) 

V B INF be #V (400) 

V B EN be EN1 #EN1 #V (600) 

V B ING be ING1 #ING1 #V (700) 

V B PL FIN are #V (300) 

V B PL FIN 1 were #V (500) 

V FIN wrote #V (166) 

V INF write #V (254) 

V INF 1 chew #V (138) 

V INF 2 walk #V (318) 

V INF 3 wash #V (99) 

V ING chew ING1 #ING1 #V (623) 

V ING 1 walk ING1 #ING1 #V (58) 

V ING 2 wash ING1 #ING1 #V (102) 

V EN made #V (155) 

V EN 1 brok EN1 #EN1 #V (254) 

V EN 2 tak EN1 #EN1 #V (326) 

V EN 3 lash ED #ED #V (160) 

V EN 4 clasp ED #ED #V (635) 

V EN 5 wash ED #ED #V (23) 
ING1 ing #ING1 (1883) 

EN1 en #EN1 (1180) 
ED ed #ED (818) 

Figure 11: A grammar for the syntax of English auxiliary verbs (Part 2). 
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Figure 12: The best alignment found by SP61 with 'it is wash ed' in New and the 
grammar from Figures HtH and fTTI in Old. 
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Figure 13: The best alignment found by SP61 with 'will it have be en brok en' in 
New and the grammar from Figures [TTJ1 and ITTI in Old (Part 1). 
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Figure 14: The best alignment found by SP61 with 'will it have be en brok en' in 
New and the grammar from Figures [TTJ1 and ITTI in Old (Part 2). 
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Figure 15: The best alignment found by SP61 with 'are they walk ing' in New and 
the grammar from Figures ITuI and ITTI in Old. 
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in the sentence pattern. The symbols 'XR ... #S' encode the remainder of the 
sequence of verbs. 

The first 'XI' pattern encodes verb sequences which start with a modal verb 
('M'), the second one is for verb sequences beginning with a finite form of the 
verb 'have' ('H'), the third is for sequences beginning with either of the two 
'B' verbs in the primary sequence (see below), and the last 'XI' pattern is for 
sentences which contain a main verb without any auxiliaries. 

In the first of the 'XI' patterns, the subsequence 'XR ... #S' encodes the 
remainder of the sequence of auxiliary verbs using the symbols 'XH XB XB 
XV. In a similar way, the subsequence 'XR ... #S' within each of the other 
'XI' patterns encodes the verbs which follow the first verb in the sequence. 

Notice that the pattern 'XI 2 XB1 FIN #XB1 #X1 XR XB XV #S' can 
encode sentences which start with the first 'B' verb and also contains the second 
'B' verb. And it also serves for any sentence which starts with the first or the 
second 'B' verb with the omission of the other 'B' verb. In the latter two cases, 
the 'slot' between the symbols 'XB' and 'XV is left vacant. FigurelT2l illustrates 
the case where the verb sequence starts with the first 'B' verb with the omission 
of the second 'B' verb. Figure IT51 illustrates the case where the verb sequence 
starts with the second 'B' verb (and the first 'B' verb has been omitted). 

7.2.4 The secondary constraints 

The secondary constraints are represented using the patterns 'M INF', 'H EN', 
'B XB INC and 'B XV EN'. Singular and plural dependencies are marked in a 
similar way using the patterns 'SNG SNG' and 'PL PL'. 

Examples appear in all three alignments in Figure El Figures El and El 
and Figure El In every case except one (row 4 in Figure 112ft , the patterns 
representing secondary constraints appear in the bottom rows of the alignment. 
These examples show how dependencies bridging arbitrarily large amounts of 
structure, and dependencies that overlap each other, can be represented with 
simplicity and transparency in the medium of multiple alignments. 

Notice, for example, how dependencies between the first and second verb in a 
sequence of auxiliary verbs are expressed in the same way regardless of whether 
the two verbs lie side by side (e.g., the statement in Figure IT2|) or whether they 
are separated from each other by the subject noun- phrase (e.g., the question 
in Figures IT3l and ITTI and the question in Figure EJ- Notice, again, how the 
overlapping dependencies in Figures 1131 and 1141 and their independence from 
each other are expressed with simplicity and clarity in the ICMAUS framework. 

Readers may wonder why the two patterns representing dependencies be- 
tween a 'B' verb and whatever follows it ('B XB ING' and 'B XV EN') contain 
three symbols rather than two. One reason is that, when two (or more) patterns 
begin with the same symbol (or sequence of symbols), the scoring method for 
evaluating alignments requires that the two patterns can be distinguished from 
each other by one (or more) symbols in each pattern which does not include the 
terminal symbol in each pattern. A second reason is that the second symbol 
in each pattern helps to determine whether the 'B' at the start of the pattern 
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corresponds to the first or the second 'B' verb in the primary sequence: 

• 'B XB INC. The inclusion of 'XB' in this pattern means that the 'B' verb 
is the first of the two 'B' verbs in the primary sequence and the following 
verb must be 'INC. 

• 'B XV EN'. The inclusion of 'XV in this pattern means that the 'B' verb 
may be the first or the second of the two 'B' verbs. However, since the first 
case is already covered by 'B XB INC, this pattern covers the constraint 
between the second 'B' verb and verbs of the category 'EN'. 

8 Discussion and Conclusion 

This section considers briefly a selection of topics relating to the development 
of these ideas. 

8.1 Other examples 

In the space available, it has not been possible to show more than a small selec- 
tion of examples. Additional example may be found in Wolff 98a, Wolff 98b 
showing: how the system can find alternative parsings when there are am- 
biguities in the text being parsed; how recursive structures in syntax can be 
parsed; how the provision of appropriate context can resolve ambiguities when 
discontinuous dependencies of one type are nested, one within another; and one 
possible way in which 'cross-serial dependencies' in syntax may be represented 
and parsed in the ICMAUS framework. 

Other examples showing how the system can handle recursive structures in 
syntax may be found in |Wolff 99~aj . 

8.2 Parsing and learning 

As was noted in Section 11.21 much of the thinking in this research programme 
is based on an earlier programme of research into the unsupervised learning 
of linguistic structures [Wolff 911 IWolff 881 IWoIff~82] . The ICMAUS framework 
and the SP61 model have been developed with the express intention that they 
should accommodate inductive learning and integrate it in a seamless manner 
with other capabilities of the model. 

It is envisaged that the framework will be developed so that, when New 
information is received that cannot be unified fully with patterns in Old, the 
patterns or parts of patterns in New which do not unify with existing patterns 
in Old will be simply added to Old with system-generated code symbols where 
appropriate. By hypothesis in this research programme, the process of adding 
New knowledge to Old in a manner which minimises redundancy (as far as is 
practically possible) will capture the essentials of unsupervised inductive learn- 
ing. 
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Development of the ICMAUS framework to incorporate inductive learning 
in this way is currently in progress. 

8.3 Potential advantages of using patterns to represent 
knowledge 

In this research, 'patterns' have been adopted as the medium for representing 
knowledge: 

• Because they seem to offer a good prospect of providing a 'universal' 
medium for representing diverse kinds of knowledge. 

• For similar reasons, 'patterns' seem to lend themselves to the representa- 
tion of knowledge at both 'concrete' and 'abstract' levels of abstraction. 

• For these reasons, the use of patterns may facilitate the seamless integra- 
tion of diverse kinds of knowledge over a wide range of abstractions. 

• Likewise, the use of patterns may facilitate the development of a learning 
system that can operate freely with diverse kinds of knowledge over a wide 
range of abstractions. 

8.4 Integration and generalisation 

If, as suggested in Section fl.2l both linguistic and non-linguistic structures may 
be accommodated naturally within the ICMAUS framework, then grammars 
of the kind shown previously may, at some stage, be extended seamlessly to 
include the 'meanings' of syntactic forms. Parsing and production of language 
as described here should generalise without radical reorganisation to a more 
rounded model of language understanding and production of language which 
includes meanings. 

In a similar way, the potential of the system noted in Section ll. 21 to accom- 
modate other aspects of 'intelligence' such as probabilistic and other kinds of 
reasoning, best-match pattern recognition and inductive learning suggests po- 
tential in the system for the eventual integration of natural language processing 
with non-linguistic 'intelligence' of various kinds. 

8.5 Conclusion 

In this article I have tried to show informally with examples how the represen- 
tation of natural language syntax and the parsing and production of natural 
language may be understood as ICMAUS. 

A novel feature of these proposals is the superficially paradoxical idea that a 
single process of information compression by multiple alignment, unification and 
search may achieve both the encoding and the decoding of information, both 
the analysis and the production of sentences. This is not simply a gimmick: in 
practical terms it offers the prospect that one search engine may be used for 
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both purposes and it offers a theoretical bonus in extending the explanatory 
range of the model without the need for any ad hoc additions or modifications. 

The suggested method of representing the syntax of natural language ap- 
pears to be simpler and more direct than existing methods. This method may 
have benefits in the creation of hand-crafted grammars for natural languages. 
Perhaps more significantly, it may simplify the automatic learning of grammars 
for natural languages which is envisaged in the further development of these 
ideas. 

In general, an important motivation for further development of these ideas 
is the potential which they offer for the integration of parsing and production 
of language with other aspects of computing including unsupervised learning, 
deductive and probabilistic inference, (fuzzy) pattern recognition, (fuzzy) infor- 
mation retrieval and others. In the broadest terms, the aim of this research 
programme and a touchstone for its success or failure is the development of a 
model which exhibits a favourable combination of conceptual simplicity with 
explanatory or descriptive power. 
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A Definitions of terms 
A.l 'Symbol' 

A symbol is some kind of mark which can be compared with any other symbol. 
In the context of pattern matching, a symbol is the smallest unit which can 
participate in matching: a symbol can be compared (matched) only with another 
single symbol and the result of matching is either that the two symbols are the 
same or that they are different. No other result is permitted. 

An important feature of the concept of a symbol, as it is used in this research, 
is that, with one qualification, it has no hidden meaning. In this research, a 
symbol is a primitive mark which can be discriminated in a yes/no manner 
from other symbols. There are no symbols like the symbols in an arithmetic 
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function (e.g., '6', '22', '+', ! x', '/', '(', ')' etc), each of which has a meaning 
for the user which is not directly visible. 

The one qualification to the slogan "no hidden meaning" is that it seems 
necessary to allow the system to make a distinction, relative to each pattern, 
between symbols that are 'code' for that pattern and symbols that are 'data' 
or 'contents' for the pattern. Labels like 'code' or 'data' reflect operations of 
the system itself (or some comparable system in the past) and may therefore 
be regarded as distinct from 'uscr-oricntcd' meanings that are intrinsic to the 
material being processed. 

For any given symbol (or group of symbols), it is possible to express mean- 
ings of this latter kind but those meanings must take the form of one or more 
additional symbols which are associated with the given symbol (or group of 
symbols) and are thus explicit and visible within the structure of symbols and 
patterns. 

A. 1.1 'Symbol type' and 'alphabet' 

If two symbols match, we say that they belong to the same symbol type. In any 
system which contains symbols, we normally recognise an alphabet of symbol 
types such that every symbol in the system belongs in one and only one of the 
symbol types in the alphabet, and every symbol type is represented at least 
once in the system. 

A.1.2 'Hit' and 'gap' 

A positive match between two symbols is termed a hit. In any given pattern in 
an alignment of two more patterns, one or more unmatched symbols between 
two hits in the pattern or before the first or after the last hit is termed a gap. 

A. 2 'Pattern' 

A pattern is an array of symbols in one, two or more dimensions. In this article, 
one dimensional patterns (sequences or strings of symbols) are the main focus 
of attention. 

The meaning of the term pattern includes the meanings of the terms substring 
and subsequence, defined next. 

A.3 'Substring' 

A substring is a sequence of symbols of length n within a sequence of length m, 
where n ^ m and where the constituent symbols in the substring are contiguous 
within the sequence which contains the substring. 

A. 4 'Subsequence' 

A subsequence is a sequence of symbols of length n within a sequence of length 
m, where n < m and where the constituent symbols in the subsequence may 
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not be contiguous within the sequence which contains the subsequence. The 
set of all subsequences of a given sequence includes all the substrings of that 
sequence. 

A. 5 'Alignment' 

In the case of one-dimensional patterns, 6 an alignment is a two-dimensional 
array of one or more sequences of symbols, each one in a separate row in the 
array. The alignment shows sets of two or more matching symbols by arranging 
the symbols in each set in a column of the array. 7 In an alignment, as defined 
in this research: 

• Symbols which are contiguous in a pattern which appears in an alignment, 
need not occupy contiguous cells in the array. 

• Any one pattern may appear zero or more times in an alignment. 

• Where a pattern appears two or more times in an alignment, no symbol 
in one appearance of the pattern should ever be shown as matching the 
same symbol in another appearance of the pattern. 

• Any symbol in one pattern may be placed in the same column as any 
other symbol from the same pattern or another pattern, providing order 
constraints are not violated. 

For any alignment, order constraints are preserved if the following state- 
ment is always true: 

For any two rows in the alignment, A and B, and any four symbols, 
Ai and A2 in A, and Bi and B2 in B, if Ai is in the same column 
as Bi, and if A2 is in the same column as B2, and if A2 in A follows 
Ai in A, then B2 in B must follow Bi in B. 

This condition holds when the two rows contain two different patterns and also 
when the two rows contain two appearances of one pattern. 

A. 6 'Mismatch' 

A mismatch in an alignment occurs when, between two columns in the align- 
ment containing hits, or between one column containing hits and the beginning 
or end of the alignment, there are no other columns containing hits and there 
are two more columns containing single symbols from two or more different 
patterns in Old. 

6 As previously noted, the concept of an alignment may be generalised to patterns of two or 
more dimensions. But no attempt is made here to provide a formal definition for alignments 
of patterns of two dimensions or higher. 

7 The fact that, in displaying alignments, it can sometimes be convenient to put non- 
matching symbols in the same column with lines to mark the symbols that do match (as in 
Figure^ is not relevant to the abstract definition of an alignment presented here. 
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B Evaluation of an alignment in terms of com- 
pression 

Section 0] described in outline how, in the ICMAUS scheme, an alignment is 
evaluated in terms of compression. This section provides more detail. 

As explained earlier, an alignment and its unification is interpreted as a 
means of encoding New or part of New in terms of patterns in Old. If New 
(the sentence to be parsed) or part of New is matched by a pattern in Old (the 
grammar) then code symbols from that pattern may be used as an abbreviated 
description of that part of New. 

By contrast with standard compression methods, each code serves a dual 
role: to identify the corresponding pattern uniquely within the grammar, and 
to mark the left and right ends of the pattern. For present purposes, the second 
role is required to remove the ambiguity which would otherwise exist about 
left-to-right sequencing of symbols in alignments. 

As noted in Section 0] the coding principle may be applied through two 
or more 'levels' so that the symbols which encode a sequence of two or more 
patterns at one level may themselves be recognised as an instance of a recurrent 
pattern which has its own code at the next higher level. Examples will be seen 
below. 

A key point in this connection is that a recurrent pattern may be discontin- 
uous in the sense that the symbols in the pattern are not necessarily contiguous 
as they appear in any or all of its occurrences. In other words, a recurrent pat- 
tern may appear as a subsequence within larger patterns. Thus, for example, 
a sequence of symbols like 'A B C D E F' may be recognised as a recurrent 
pattern within a set of instances which includes patterns like 'P A B Q C R D 
E F S\ 'A L B C D M N E F O P\ 'X A B C D Y E F Z' and so on. 

In what follows (Appendices IB. II to IB.5|) , I shall first give an informal ex- 
planation of the method of calculating the compression associated with any 
alignment using the example shown in Figure 0] Then the principles embodied 
in the method are discussed in Appendix IB. 61 and a formal summary of the 
method is presented in Appendix IB. 71 

B.l Encoding individual symbols 

The simplest way to encode individual symbols in the sentence and the grammar 
is with a 'block' code using a fixed number of bits for each symbol. In the 
grammar in Figure|21 there are 24 symbol types so the minimum number of bits 
required for each symbol is |7og224] = 5 bits per symbol. 

In fact, the SP61 model (described in Section |SJ) uses variable-length codes 
for symbols, assigned in accordance with the Shannon-Fano-Elias (S-F-E) cod- 
ing scheme (described by [Cover and Thomas 91p so that the shortest codes 
represent the most frequent symbols and vice versa. 

Notice that the number of bits required for each symbol is entirely indepen- 
dent of the number of characters in the name of the symbol as it is shown in the 
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examples. Names of symbols are chosen purely for their mnemonic value and 
to aid comprehension. 

There are many variations and refinements that may be made at this level 
but, in general, the choice of coding system for individual symbols is not critical 
for the principles to be described below where the focus of interest is the ex- 
ploitation of redundancy which may be attributed to sequences of two or more 
symbols rather than any redundancy attributed to unbalanced frequencies of 
individual symbols. 

For reasons which are given in Section 16.11 connected with the decoding 
of information, the code for each symbol has two different sizes (in bits): a 
'minimum cost' which is the theoretical minimum number of bits needed to 
represent that symbol according to the S-F-E calculations, and an 'actual cost' 
which is the (larger) number of bits that are needed to allow robust decoding 
of information as well as encoding. 

In the following informal description of the encoding principles, the distinc- 
tion between the 'minimum cost' and the 'actual cost' of each symbol is not 
important and will be ignored. For the sake of simplicity in this presentation, 
it will be assumed that all symbols are encoded with the same number of bits 
so that 'one symbol' can be treated as the minimum unit of information. 

B.2 Encoding words 

As explained in Section 0] a word like 'this' in the grammar shown in Figure 
[3] may be encoded as 'D #D', In a similar way, the word '1 o v e s' may be 
encoded as 'V #V and likewise for the other words. In all cases except 'b o 
y', there is a modest saving of one or two symbols for each word. 

B.3 Encoding phrases 

Consider the phrase 't h i s b o y'. If this were encoded with a code pattern 
for each word, the result would be 'D #D N 1 #N' which is only one symbol 
smaller than the original. However, we can encode the phrase with fewer symbols 
by taking advantage of the fact that the sequence 'D #D N 1 #N' has a 
subsequence, 'D #D N #N', which is a substring within the pattern 'NP D 
#D N #N #NP' in the grammar. Notice that the sequence 'D #D N #N' 
is discontinuous within the sequence 'D #D N 1 #N' in the sense described 
earlier. 

Since the 'noun phrase' pattern 'NP D #D N #N #NP' is in the grammar, 
we may replace the substring, 'D #D N #N', by the 'code' sequence 'NP #NP'. 
But then, to encode the two words within the noun phrase ('t h i s' and 'b o 
y'), we must add the symbols, '0' and '1' from 'D #D N 1 #N' so that the 
final coded sequence is 'NP 1 #NP'. 

Notice how the symbols 'NP' and '#NP' in the code pattern 'NP 1 #NP' 
serve as a disambiguating context so that the symbol '0' identifies the pattern 
'D t h i s #D' and the symbol T' identifies the pattern 'N 1 b o y #N'. The 
overall cost of the code pattern 'NP 1 =#=NP' is 4 symbols compared with the 
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original 7 symbols in 't h i s b o y' - a saving of 3 symbols. In a similar way, 
the phrase 't h a t g i r 1' may be encoded as 'NP 1 #NP' which is 4 symbols 
smaller than the original. 

B.4 Encoding the sentence 

Given the two noun phrases in their encoded forms ('NP 1 #NP' for 'this 
boy' and 'NP 1 #NP' for 't h a t g i r 1') and the encoding of '1 o v e s' as 
'V #V\ the whole sentence may be encoded as 'NP 1 #NP V #V NP 1 
#NP'. 

However, this sequence contains the subsequence 'NP #NP V #V NP #NP' 
and this sequence is a substring within the 'sentence' pattern 'S NP #NP V 
#V NP #NP #S' - and this pattern is in the grammar. So we may replace 
the sequence 'NP #NP V #V NP #NP' by the 'code' sequence 'S #S\ To 
discriminate the words in this sentence we must add the symbols '01010' 
from the sequence 'NP 1 #NP V #V NP 1 #NP'. The overall result is an 
encoded representation of the sentence as: 

S 1 1 #S. 

The 7 symbols in this encoding of the sentence represents a substantial 
compression compared with the 20 symbols in the unencoded sentence. 

B.5 Taking account of the sizes of gaps 

The account of pattern matching and coding in Sections IB. 31 and IB . 41 illustrates 
the way in which 'matching' in the proposed scheme embraces the matching 
of subsequences (where the matched symbols need not be contiguous) as well 
as the more traditional matching of coherent substrings (where the matched 
symbols are always contiguous, one with the next). 

In this connection, most people have a strong intuition that, where there are 
gaps in matching, small gaps or no gaps are 'better' than large ones. It seems 
that our intuitions in this area can be justified in terms of probability theory. 
A method, based on probability principles, for making allowances for gaps has 
been developed and is applied in the SP61 model. A brief outline of the method 
and how it is applied is presented in Sect ion IB. 71 below. 

B.6 Discussion 

Each pattern expresses sequential redundancy in the data to be encoded and 
this sequential redundancy can be exploited to reduce the number of symbols 
which need to be written out explicitly. In the grammar shown in Figure |3 
each pattern for an individual word expresses the sequential redundancy of the 
letters within that word; the pattern for a noun phrase expresses the sequential 
redundancy of 'determiner' followed by 'noun'; and the pattern for a sentence 
expresses the sequential redundancy of the pattern: 'noun phrase' followed by 
'verb' followed by 'noun phrase'. 
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Since this principle operates at all levels in the 'hierarchy' of patterns, many 
of the symbols at intermediate levels may be omitted completely. A sentence 
may be specified with symbols marking the start and end of the sentence pattern 
together with interpolated symbols which discriminate amongst alternatives at 
lower levels. 

Notice that these ideas are only applicable to alignments which can 'project' 
into a single sequence of symbols, as is the case with the alignment shown in 
Figure Any alignment like this: 

a x b a b x 

I I or this I I 
a y b a b y 

where there is a 'mismatch' of symbols, cannot be evaluated in this way. For 
present purposes, any such alignment is excluded from consideration. When 
the SP model is generalised to other areas such as learning, it is intended that 
alignments like those just shown will be evaluated alongside those which can 
project without mismatches. 

The method that has been described illustrates the role of context in the 
encoding of information. Any one symbol like '0' or '1' is ambiguous in terms of 
the patterns in the grammar in Figure |2| But in the context of the pattern 'S 
10 10 #S' and the same grammar, it is possible to assign each instance of '0' or 
'1' unambiguously to one of the words in the grammar, giving the sequence of 
words in the original sentence. It appears that ICMAUS provides a mechanism 
for 'decoding' the encoded form of the sentence, as discussed in Sectional 

B.7 Summary of method for calculating the compression 
associated with an alignment 

The proposed method of calculating the compression difference (CD) associated 
with an alignment of patterns is summarised in more formal terms here. This 
is the method embodied in the SP61 model (which is described in Section El 
and Appendix 0. The method is designed to calculate the compression of New 
information or part of it (all or part of the sentence to be parsed) which may 
be achieved by 'encoding' New information in terms of Old information (where 
Old information is the patterns of symbols representing the grammar used in 
parsing). This CD is calculated as: 

CD = Bn — Be , 

where Bn is the number of bits required to represent the hit symbols in New 
without any encoding (except S-F-E coding at the level of single symbols), and 
Be is the number of bits required for the encoding of those same symbols from 
New in terms of Old information. How these values are calculated is described 
below. 
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B.7.1 Information costs of symbols 

If a simple block code is used for symbols, then the 'minimum cost', M, for each 
symbol is 

M=\log 2 \S\] 

bits where |iS| is the number of symbol types in the alphabet of symbol types 
(S) used throughout New and Old. 

As previously noted, the value of M for each symbol type (and thus each 
individual symbol) is calculated in SP61 by the S-F-E method. For any one 
symbol type, the input for this calculation is the frequency of occurrence of the 
symbol type either measured directly or approximated using this formula: 

p 

fat = ^2ifiXO l ) 

i=l 

where fi is the (notional) frequency of the ith pattern in the grammar (illus- 
trated by the numbers on the right of Figure|3J), o; is the number of occurrences 
of the given symbol in the ith pattern and P is the number of patterns in the 
grammar. 

Whichever way the value of M is calculated, the 'actual cost', A, of each 
symbol is: 

A = M X c, 

where c is a factor whose size is not critical except that c > 1. 

B.7.2 Calculation of E, the minimum number of bits required for 
the encoding of a given pattern in Old 

The calculation of Be for any alignment requires a value for the 'encoding cost', 
E, for each pattern from Old which appears in the alignment. 

Since there is a frequency of occurrence associated with each pattern in any 
grammar, it is possible to calculate a theoretical minimum for the value of E for 
each pattern using the S-F-E method. However, there is an alternative method 
of calculating E which, for present purposes, appears to be more useful and 
which has been adopted in the SP61 model described in Section [5] 

In summary, the alternative method is to calculate E as 

n 

E = J2 D i 

i=l 

where Di is the M value for the ith symbol in a subsequence of n 'discrimination' 
symbols within the given pattern which identifies the pattern uniquely amongst 
the patterns in the grammar without over-specifying the pattern. 
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Ideally, the discrimination symbols for a pattern would be whatever sub- 
sequence of the pattern was most distinctive of the pattern, regardless of the 
position of the symbols within the pattern. However, in the SP61 model, two 
constraints have been imposed: 

• The simplifying assumption has been made that the discrimination sym- 
bols are the smallest substring of one or more symbols starting at the be- 
ginning of the pattern which enables the pattern to be identified uniquely 
within the grammar. For any pattern, it is easy to discover what this 
substring is by a process of systematic comparison of candidate substrings 
with corresponding symbols in other patterns in the grammar. 

Although a constrained subsequence of symbols is used in calculating the 
value of E for the pattern, this does not mean that a pattern can only 
ever be recognised by those symbols and no others. In the SP61 model, 
a pattern can be fully or partially recognised by any subsequence of its 
symbols. 

• Whenever a pattern ends in a 'termination' symbol (a symbol whose first 
character is the hash character ('#')), this symbol is added to the set of 
discrimination symbols for the pattern if it is not otherwise there. 

B.7.3 Calculation of Bm (the number of bits required to represent 
hit symbols from New in 'raw' form) 

For any one alignment, Bjy is calculated as: 

h 

i=l 

where Ai is the 'actual cost' of the symbol corresponding to the ith hit in 
a sequence of hits, H\...Hh, with an adjustment to be described in the next 
paragraph. The hit sequence H\...Hh comprises the hits between symbols in 
New and symbols in patterns in Old. The symbols from New in this hit sequence 
are a subsequence of the sequence Ni...N n , which is the pattern in New. 

B.7.4 Allowing for gaps 

Before the formula, above, is applied, the value of each Ai is adjusted to take 
account of any 'gap' which may exist between the given hit and any previous 
hits in the sequence of hits between New and patterns in Old. For this purpose, 
the alignment is treated as if it were two sequences of symbols: the sequence of 
symbols which is New (the sentence being parsed) and the sequence of symbols 
which is the projection of the alignment into a single sequence. 

As indicated above, there is insufficient space to present fully the method 
of allowing for gaps. In outline, it is based on an analogy with the rolling of 
two ^4-sided dice, where A is the size of the alphabet used in New and Old. 
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The sequence of rolls of one die corresponds with the sequence of symbols in 
New and the sequence of rolls of the other die corresponds with the sequence of 
symbols in the projection. The method is based closely on the method described 
in |Lowry 89] for calculating probabilities of various contingencies in problems 
of this type. 

For the symbol corresponding to the ith hit in the sequence Hi...Hh, the 
adjusted value of Ai is calculated as: 

Ai = ^ x F s 

where ai is the actual cost of the symbol corresponding to the ith hit in H\...Hht 
and F s is the sth entry in a table of 'scaling factors' which is calculated at the 
outset of processing. The value of i 7 ! is always 1. For each hit in H\...Hh after 
the first, the variable s (which represents the 'span' between the current hit in 
Hi...Hh and the preceding hit) is calculated as: 

s=(P i -P i - 1 ) x (Ci-Q-i) 

where Pi is the position in N\...N n of the symbol corresponding to the ith 
hit in H\...Hh, Pi-i is the position in Ni...N n of the symbol corresponding to 
the (i — l)th hit in H\...Hh. Ci and Cj_i are the analogous positions in the 
projection of the alignment into a single sequence - which means that Ci and 
Cj_i represent columns in the alignment itself. 

B.7.5 Calculation of Be (the number of bits required to encode the 
hit symbols from New) 

For each new alignment, the value of Be is: 

r 

Be = /J Ei — S 

i=2 

where Ei is the 'encoding cost' of the Old pattern appearing on one of r rows of 
the alignment other than the top line (where New appears) and S is the saving 
in encoding costs arising from the fact that some patterns in the alignment 
convey information about the sequential arrangement of other patterns in the 
alignment or the selection of other patterns in the alignment where alternatives 
are possible in a given context. 

The 'encoding cost' of any pattern is the value of E for that pattern, calcu- 
lated as described in Appendix IB. 7. 21 Notice that if any pattern appears two or 
more times in the alignment, its encoding cost is added a corresponding number 
of times to the sum of encoding costs. 

The calculation of Be depends on three main ideas: 

• As previously noted, a pattern may be fully or partially recognised by 
any subsequence of the pattern. In other words, it is not necessary to 
use the specific symbols which were used in calculating the value of E 
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for that pattern. As a general rule when the grammar is largely free of 
redundancy, if the M values of the relevant symbols (adjusted for gaps - 
see next) add up to the value of E then that subsequence of symbols will 
identify the pattern uniquely amongst the other patterns in the grammar. 
Where there is redundancy in the grammar, more bits may be needed to 
achieve unique identification of a pattern. 

• If there are gaps in a sequence of hits, information values must be reduced 
in accordance with the rules used in calculating the value of Bn (Appendix 

EH. 

• For present purposes, there is nothing to be gained by over-specifying 
a pattern. If one pattern matches a second pattern by the minimum 
number of symbols needed to achieve unique identification of that second 
pattern, then the saving in encoding costs from this source is maximal. 
Any additional hits between the two patterns do not give any additional 
saving in encoding costs. 

Be is calculated in the following way: 

1. For each row (i?) in the alignment corresponding to a pattern from Old, 
create a variable (V) containing the value of E for the pattern in that row. 

2. Traverse the alignment from left to right examining the columns containing 
two or more symbols (including symbols in New). Any such column is 
designated a 'hit' column (Ch)- 

3. For each Ch which contains two or more symbols from patterns in Old 
(which we may designate Cho), examine each row which has a hit symbol 
from Old in the column (designated Rho)- For this symbol, calculate Ma, 
an 'adjusted' value of M for the symbol, taking account of any gap which 
may exist between the given Cho an d any previous Cjj. The method of 
making the adjustment is the same as is used for calculating the value 
of Bn ^Section |B.7.4|I except that, for each Rho, the gaps (or spans) 
are measured as if all the rows in the alignment except the given Rho 
is treated as if it were a single pattern to which the pattern in the given 
RHO is aligned. As in the calculation of J3/v, it is assumed that there is 
no gap associated with the first Ch for any given pattern. 

4. For each Cho, examine each Rho and, amongst these rows, identify the 
'leading' row, Rhol, whose pattern starts furthest to the left in the align- 
ment (if there is a tie, make an arbitrary choice amongst the ties). For 
example, in Figure 01 for either of the two columns which contains a hit 
between 'D' in 'D t h i s #D' and 'D' in 'NP D #D N #N #NP\ the 
Rhol is the one containing 'NP D #D N #N #NP' (row 7 in the first case 
and row 2 in the second case); for either of the two columns containing 
a hit between 'NP' in 'NP D #D N #N #NP' and 'NP' in 'S NP #NP 
V #V NP #NP #S', the Rhol is the row containing 'S NP #NP V #V 
NP #NP #S' (row 5 in both cases). 
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5. For each Cho, consider, in turn, each Rho, excluding the Rhol- For 
each row considered, subtract the value of Ma from the value of V for 
that row. If the new value of V is less than 0, V is set to and no further 
subtraction from that instance of V is allowed. 

6. When all relevant columns have been examined and the values of the V 
variables have been reduced, calculate 

r 

B E =Y. V i 

i=2 

where r is the number of rows in the alignment and the summation ex- 
cludes the top line (which contains New). 

The rationale for this method of calculating Be is that it gives us the sum 
of the E values of the patterns from Old corresponding to each row of the 
alignment after the first, with a reduction for hits between those patterns (with 
an adjustment for gaps as outlined above). 

The reason for reducing the value of Be when there are hits between patterns 
in Old is that any such hit reflects a degree of 'coverage' of one pattern from Old 
by another such pattern. To the extent that one pattern provides information 
that also exists in another pattern there is a reduced need for the second pattern 
to be identified in the encoding. In the extreme case, where two patterns are 
identical, only one of them need be identified in the encoding. As indicated 
above, any saving in encoding costs resulting from the coverage of one or more 
patterns by another cannot exceed the E value for each pattern - any additional 
hits are 'wasted'. Hence, the V value for any row cannot be reduced below 0. 

In the method described above, the 'leading' row for any one column (Rhol) 
is regarded as the row with which the other symbols in the column are unified. 
Hence, for the given column, this is the row where the V value is not reduced by 
the value of Ma- Intuitively, the left-to-right bias in the definition of 'leading 
row' is less theoretically 'clean' than if all concepts were entirely symmetrical 
between left and right directions in the alignment. However, the concepts as 
described are the best to have been found so far and seem to work quite well. 

C The organisation and operation of the SP61 
model 

FigurelTBlpresents a high level view of the organisation of the SP61 model using 
pseudocode while FigureslTTIandlTHlshow. with pseudocode, the first and second 
parts of the compress () function within the model. The text below describes how 
the model works together with details of its organisation that are not included 
in the pseudocode. 
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mainO 
{ 

1 Read the rules of the grammar, each one with a frequency 

of occurrence in a notional sample of the language, 
and store the patterns with their frequencies in Old. 

2 Read the sentence to be parsed and store it in New. 

3 Derive a frequency for each symbol in the grammar 

(as described in Appendix B) . 

4 Using the frequencies of the symbols with the method, 

assign to each symbol in New and Old a number 
of bits representing the 'minimum' information 'cost' 
of that symbol. Also, calculate an 'actual' 
information cost for each symbol. 

5 For each pattern in the grammar, calculate E, the 

minimum number of bits needed to encode that pattern. 

6 Select the sentence to be parsed and add it as the first 

'driving pattern' to an otherwise empty list of 
driving patterns . 

7 while (new alignments are being formed) 

compress () 

8 Out of all the new alignments which have been formed, 

print the ones with the best CDs. 

} 

Figure 16: A high level view of the organisation of the SP61 model. 
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compress () 
{ 

1 Clear the 'hit structure' (described in the text). 

2 while (there are driving patterns that have not 

yet been processed) 

{ 

2.1 Select the first or next driving pattern 

in the set of driving patterns. 

2.2 while (there are more symbols in the 

current driving pattern) 

{ 

2.2.1 Working left to right through the 
current driving pattern, select the 
first or next symbol in the pattern. 

2.2.2 'Broadcast' this symbol to make a 
yes/no match with every symbol in the 
'target patterns' in Old. 

2.2.3 Record each positive match (hit) in a 
'hit structure' (as described in the 
text). As more symbols are broadcast, 
the hit structure builds up a record 
of sequences of hits between the 
driving pattern and the several target 
patterns in Old. As each hit sequence 
is extended, the compression score of 
the corresponding alignment is 
estimated using a 'cheap to compute' 
method of estimation. 

2.2.4 If the space allocated for the hit 
structure is filled at any time, the 
system 'purges' the worst hit sequences 
from the hit structure to release more 
space. The selection uses the estimates 
of compression scores assigned to each 
hit sequence in Step 2.2.3. 

} 

} 

Figure 17: First part of the compress() function of the SP61 model. 
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3 For each hit sequences which has an estimated 

compression score above some threshold value 
and which will 'project' into a single 
sequence (as described in the text) , convert 
the hit sequence into the corresponding 
alignment . Discard this alignment if it is 
identical with any alignment already in Old. 
Otherwise, compute the compression score using 
the method described in Appendix C, print 
the new alignment and add it to Old. If 
no new alignments are formed, quit the 
compressO function. 

4 Excluding the original patterns in Old, examine 

all the alignments that have been added to Old 
since the beginning of processing and choose 
a subset of these alignments using the method 
described in the text . Remove from Old all the 
alignments which have not been selected. The 
original patterns are never removed from Old. 

5 Clear the list of driving patterns and then, using 

the same method as is used in 4 but (usually) 
with a more restrictive parameter, select a 
subset of the alignments remaining in Old and 
add references to those alignments to the list 
of driving patterns (these patterns are not 
removed from Old and may therefore also be 
target patterns on the next cycle) . 

Figure 18: Second part of the compress() function of the SP61 model. 
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C.l Preliminary processing 

C.l.l Calculation of the information cost of each symbol 

As was described in Section 13.11 each rule in the grammar has an associated 
frequency of occurrence in (a 'good' parsing of) some notional sample of the 
language. In Step 3 of main() in Figure [TBI the model derives the frequency of 
occurrence of each symbol type as described in Appendix IB. 7. II 

These frequencies are then used (in Step 4 of main(J) to calculate the min- 
imum number of bits needed to represent each symbol type using the S-F-E 
coding scheme (see [Cover and Thomas 91| 1. as described in Appendix IB. 7. II 
The resulting sizes for each symbol type are then assigned as 'minimum cost' 
sizes to corresponding symbols in New and Old. Each symbol in New and Old is 
also given an 'actual cost' which is the minimum cost increased by an arbitrary 
factor, rounded up to ensure that the actual cost is at least one bit larger than 
the minimum cost (see Section lB.7.1ll . 

C.l. 2 Establishing the encoding cost of each pattern in Old 

In Step 5 of mainQ in Figure ITHl each pattern in the grammar is assigned a 
minimum number of bits required to discriminate the pattern from other pat- 
terns in the grammar using frequencies of the patterns with the S-F-E method, 
as was outlined in Section TB. 7. 21 

C.2 Building the 'hit structure' (step 2 of the compress() 
function in Figure I17|) 

The compress() function shown in Figures IT7I and IT81 is the heart of the SP61 
model. This subsection and the ones that follow supplement the description in 
the figure. 

As can be seen from the figure and inferred from the outline description in 
Sectional the compress() function is applied iteratively. On the first cycle, the 
'driving' pattern is simply the sentence to be parsed. On subsequent cycles, the 
list of driving patterns is a subset of the alignments formed in preceding cycles. 
Iteration stops when no new alignments can be found which satisfy conditions 
described below. 

C.2.1 Fuzzy matching of one pattern with another 

Step 2 of the compress() function is based on the central process in SP21 
Wolff 94a , a process which is related to dynamic programming (DP, 
Wag ner and Fischer 74| ) and is designed to find 'fuzzy' matches which are 
'good' between one 'driving' pattern and one or more 'target' patterns. In this 
context, a 'fuzzy' match is one where only a subsequence of the symbols in one 
pattern need match the symbols in the other pattern and vice versa. 

The technique is to 'broadcast' each symbol in the driving pattern to make 
a yes/no match with each symbol in the set of target patterns and to record 
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sequences of hits in a 'hits structure'. Each sequence of hits (termed a hit 
sequence) represents an alignment between the driving pattern and one of the 
target patterns. 

As is described in Wolff 94a , the hit structure has the form of a list- 
processing tree with each node representing a hit and each path from the root 
to a leaf node representing a sequence of hits. 

C.2.2 No one instance of a symbol should ever be matched with 
itself 

Since driving patterns can also be target patterns, any one pattern may be 
aligned with itself. That being so, a check is made to ensure that no instance of 
a symbol is ever matched against itself (see Section l2~2l . Obviously, any such 
match would be meaningless in terms of the identification of redundancy. 

Since any symbol in the driving pattern and any symbol in the target pattern 
may have been derived by the unification of two or more other symbols, a check 
is also made to exclude all hits where the set of symbols from which one of the 
hit symbols was derived has one or more symbols in common with the set of 
symbols from which the other hit symbol was derived. In short, while any given 
pattern from the grammar may appear two or more times in one alignment, no 
symbol in any of the original patterns in Old ever appears in the same column 
as itself in any alignment. 

C.2.3 The order of symbols in New must be preserved 

As the matching process has been described so far, it would be entirely possible 
for the system to align a pattern like 'NP D 1 t h a t #D N 1 g i r 1 #N #NP' 
in the example considered earlier with the first 'NP #NP' in a pattern like 'NP 
#NP V 1 o v e s #V NP #NP #S' from the same example and to align 'NP 
D0this#DN0boy#D #NP' with the second 'NP #NP'. To avoid the 
formation of alignments like this which violate the order of the symbols in New, 
the system makes checks to ensure, at all stages, that the order of the symbols 
in New is honoured. 

C.2.4 Estimation of compression scores 

While the hit structure is being built, the compression score for the alignment 
corresponding to each hit sequence may be calculated at every stage but only at 
the cost of a lot of processing which would slow the model down. Consequently, a 
simple method of estimating the compression score is used in Step 2.2.3 of Figure 
1171 which is computationally 'cheap'. Although it gives results which do not 
correspond exactly with the values calculated using the formulae presented in 
Appendix IbI the differences appear not to be critical for the purposes of purging 
the hit structure (Step 2.2.4 in Figure IT7I Appendix IC. 2. 5|) or determining the 
threshold for converting hit sequences into alignments (Step 3 in Figure 1161 
Appendix IC.3|) . 
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C.2.5 Purging the hit structure 

If the space allocated to the hit structure is exhausted at any time, the hit struc- 
ture is 'purged' or, more literally, 'pruned' to remove branches corresponding to 
the worst 50% of the hit sequences (where the meaning of 'worst' is determined 
using the estimates of compression scores calculated in Step 2.2.3 of the com- 
press () function). In this way, space is released in which new sequences of hits 
can be stored. 

C.2.6 Distinctive features of the technique 

The technique of recording hits in a tree using list processing, coupled with 
the mechanism for purging the hit structure whenever the available space is 
filled, is probably the most important difference between the SP21 technique for 
finding partial matches and the more traditional kinds of DP. In the SP21/SP61 
technique: 

• Both strings being compared can be arbitrarily long. 

• The 'depth' of searching can be controlled by varying the space available 
for the hit structure: larger spaces give better results than smaller ones. 

• Unlike standard DP algorithms, the system delivers a set of alternative 
alignments between two sequences rather than a single 'best' alignment. 

C.3 Building, scoring and selection of alignments 

C.3.1 Building alignments and scoring them (step 3 of the compress() 
function) 

When the hit structure for a set of driving patterns has been built, the best 
hit sequences are converted into the corresponding alignments, excluding all 
alignments which will not 'project' on to a single sequence (as described in 
Section l3~2*l and Appendix IB. 6(1 and excluding alignments as described in lC.3.21 

The process of converting a hit sequence into an alignment achieves two 
things: it creates a one-dimensional sequence of symbols which is a unification 
of the driving pattern or patterns with the target pattern and it creates a two- 
dimensional array representing the alignment itself. For each alignment, the 
array occupies a portion of memory of exactly the right size, allocated dynami- 
cally at the time the alignment is formed. 

The one-dimensional sequence may enter into matching and unification in 
later iterations of the compress() function, while the two-dimensional array al- 
lows the full structure of the alignment to be seen and can be used in later checks 
to ensure that no instance of a symbol is ever matched with itself fSection lC.2.2|l 
and to ensure that the order of symbols in New is not violated ( Section IC. 2. 3|) . 

From time to time, identical alignments are formed via different routes. The 
program checks each of the newly-formed alignments against alignments already 
formed. Any alignment which duplicates one already formed is discarded. The 
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process of comparing alignments is indifferent to the order (from top to bottom) 
in which patterns appear in the alignment (cf. Section 13.21 above). 

Every new alignment which survives the several hurdles is added to Old and 
its CD is computed using the method and formulae described in Appendix iBl 

C.3.2 Selection of alignments: a quota for each hit symbol in New 

Apart from purging the hit structure when space is exhausted, the main way 
in which the SP61 model narrows its search space is a two-fold selection of 
alignments at the end of every cycle of the compress () function: 

• Excluding all the original patterns in Old, the program examines the align- 
ments which have been added to Old since the start of processing and 
selects a subset by a method to be described. All the other alignments are 
removed from Old and discarded. 

• Using the same method, the program selects a subset of the alignments 
which remain in Old to be used as driving patterns on the next cycle. 
These alignments are not removed from Old so they may also function as 
target patterns. 

At first sight it seems natural to select alignments purely on the basis of 
their compression scores. However, it can easily happen that, at intermediate 
stages in processing, the best alignments are trivial variations of each other 
and involve the same subset of the symbols from New. If selection is made by 
choosing alignments with a CD above a certain threshold, the alignments which 
are chosen may all involve the same subset of the symbols in New, while other 
alignments, containing symbols from other parts of New, may be lost. If this 
happens, the model cannot ever build an alignment which contains all or most 
the symbols in New and may thus never find the 'correct' answer. 

A solution to this problem which seems to work well is to make selections in 
relation to the symbols in New which appear in the alignments. Each symbol in 
New is assigned a 'quota' (the same for all symbols) and, for each symbol, the 
best alignments up to the quota are identified. Any alignment which appears in 
one or more of the quotas is preserved. All other alignments are purged. The 
merit of this technique is that it can 'protect' any alignment which is the best 
alignment for a given subsequence of the symbols in New (or is second or third 
best etc) but which may, nevertheless, have a relatively low CD compared with 
other alignments in Old. 

C.4 Processing New in Stages 

A feature of the SP61 model that, to avoid clutter, has been omitted from Figure 
1161 is that New may be divided into 'windows' of any fixed size (determined by 
the user) and the model can be set to process New in stages, one window at 
a time, from left to right. This feature of the model was introduced for two 
reasons: 
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• It seems to bring the model closer to the way people seem to operate, 
processing sentences stage by stage as they are heard or read, not waiting 
until the whole of a sentence has been seen before attempting to analyse 
it. 

• Since it is possible to discard all but the best intermediate results at 
the end of each window, this mode of processing has the advantage of 
reducing peak demands for storage of information and it also has the 
effect of reducing the size of the search space. 
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